This script borrows heavily from the lecture notes of STAT 471 taught at Wharton School of Business, University of Pennsylvania in the spring of 2022.
In modern data mining, we often encounter the situation where a text may contain important, useful information about response of interest. Can we predict how likely a lecture is given by a female based on the transript? One simple but effective way of learning from a text is through bag of words to convert raw text data into a numeric matrix. We first turn each text into a vector of frequency of words. The bag of words approach can be extended using the n-gram or k-skip-n-gram techniques to account for the semantics of word ordering. Then we apply existing methods that use numerical matrices to either extract useful information or carry out predictions. We will extend the regularization technique (LASSO) to classification problems.
In this TED case study, we will use the tm
package to transform text into a word frequency matrix. We will build a
classifier and conduct sentiment analysis. Finally we build a word cloud
to exhibit words for male speakers’ topics/transcripts and female
speakers’ topics/transcripts respectively.
Reference about text mining and web scraping: Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, Wiley, by Munzert/Rubba/MeiBner/Nyhuis.
Founded in 1984, TED Conferences, LLC (Technology, Entertainment, Design) is an American-Canadian media organization that posts talks online for free distribution under the slogan “ideas worth spreading”. TED was conceived by Richard Saul Wurman, who co-founded it with Harry Marks in February 1984 as a conference. It has been held annually since 1990. TED’s early emphasis was on technology and design, consistent with its Silicon Valley origins. It has since broadened its perspective to include talks on many scientific, cultural, political, humanitarian, and academic topics.
Goal: How are the contents of TED talks related to speakers’ genders? How well can we predict speakers’ genders based on the content of talks?
Data Description: The dataset contain over 4,000 TED talks in many languages. Let us only look into talks whose native languages are in English.
Data needed:
ted_talks_en.csvNRC.csvted_en_tm_freq_transcript.csvTextMining_glm_small_transcript.RDSTextMining_lasso_transcript.RDStm:
a popular text mining packagetm in
2008 while he was a phd student at TU Vienna. Users can deploy
Hadoop to handle large textual data.SnowballC: For StemmingRColorBrewer: a package for color palettewordcloud: a package for creating wordcloudglmnetrandomForestrangerstringr:
useful string packageMy favorite packages (Of course all come from Hadley Wickham….):
ggplot2, dplyr, stringr (string
manipulations), lubridate (handling date-time). Most of
them are aggregated into tidyverse.
The regularization techniques used in regression are readily applied to classification problems. Here we will penalize the coefficients while maximizing the likelihood function or minimizing the -loglikelihood function.
For a given lambda we minimize -loglikelihood. Here is the LASSO solutions:
\[\min_{\beta_0, \beta_1, \ldots, \beta_p} -\frac{1}{n}\log(\mathcal{Lik}) + \lambda \{ |\beta_1| + |\beta_2|, \ldots+ |\beta_p| \} \]
Similarly we obtain the solution for elastic net using the general penalty functions:
\[\Big(\frac{1-\alpha}{2}\Big)\|\beta\|_2^2 + \alpha \|\beta\|_1\]
Remark:
glmnet() uses the following penalized least squares for a
continuous response variable.\[\frac{1}{2n}RSS + \lambda \Big\{ \Big(\frac{1-\alpha}{2}\Big)||\beta||_2^2 + \alpha ||\beta||_1 \Big\}\]
The properties of LASSO (and elastic-net) solutions for logistic regressions are similar to that of OLS.
glmnet() and cv.glmnet() are used in
the logistic regression setup as well. We do need to specify
family=binomial. Also we may choose different minimization
criteria such as Deviance, AUC or
MCE. The default setting is by Deviance, which
means we try to minimize the CV -likelihood function.
For the remaining lecture, the first step is to do EDA as usual .
Then we digitize a text into a large dimension of word frequency
vectors. glm and LASSO methods will be used to
build models of gender based on the topics/transcripts. Finally we
report testing errors comparing different models.
Important remarks about this lecture:
Because of large dimension, part of the analyses are run through an even smaller data set.
The full analysis results have been done and saved. We will load
the results directly for the purpose of demonstration. You may run some
specific chunks to get the full results yourself but it will take some
time and may crush your laptop. On the other hand once you are
successful, output and save your results. Then set the chunk for
analysis to eval=F and load the results properly.
We have also included other methods such as
Random Forest. You may keep them as references for later
uses.
Using package data.table to read and to manipulate data
is much faster than using read.csv() especially when the
dataset is large.
data.all <- fread("data/ted_talks_en.csv", stringsAsFactors = FALSE)
dim(data.all)
object.size(data.all)
nrc<- fread("data/NRC.csv")
afinn<- fread("data/Afinn.csv")
bing<- fread("data/Bing.csv")
options(warn = -1)## [1] 4005 19
## 48003424 bytes
Let’s first take a small piece of it to work through. We could use
fread to only load nrows many rows to avoid
loading the entire dataset.
data <- data.all
data = data %>% filter(native_lang=="en") # choose those whose native language is English
names(data)
n <- nrow(data)
data_sub<- data %>%
select(talk_id, title, speaker_1, occupations, about_speakers, recorded_date, published_date, duration, topics, description, views, transcript, comments)
str(data_sub%>%
select(-transcript, -description))
data_sub %>%
select(-transcript)%>%
head()## [1] "talk_id" "title" "speaker_1" "all_speakers"
## [5] "occupations" "about_speakers" "views" "recorded_date"
## [9] "published_date" "event" "native_lang" "available_lang"
## [13] "comments" "duration" "topics" "related_talks"
## [17] "url" "description" "transcript"
## Classes 'data.table' and 'data.frame': 3957 obs. of 11 variables:
## $ talk_id : int 1 92 7 53 66 49 86 94 71 55 ...
## $ title : chr "Averting the climate crisis" "The best stats you've ever seen" "Simplicity sells" "Greening the ghetto" ...
## $ speaker_1 : chr "Al Gore" "Hans Rosling" "David Pogue" "Majora Carter" ...
## $ occupations : chr "{0: ['climate advocate']}" "{0: ['global health expert; data visionary']}" "{0: ['technology columnist']}" "{0: ['activist for environmental justice']}" ...
## $ about_speakers: chr "{0: 'Nobel Laureate Al Gore focused the world’s attention on the global climate crisis. Now he’s showing us how"| __truncated__ "{0: 'In Hans Rosling’s hands, data sings. Global trends in health and economics come to vivid life. And the big"| __truncated__ "{0: 'David Pogue is the personal technology columnist for the <em>New York Times</em> and a tech correspondent "| __truncated__ "{0: 'Majora Carter redefined the field of environmental equality, starting in the South Bronx at the turn of th"| __truncated__ ...
## $ recorded_date : IDate, format: "2006-02-25" "2006-02-22" ...
## $ published_date: IDate, format: "2006-06-27" "2006-06-27" ...
## $ duration : int 977 1190 1286 1116 1164 1198 992 1485 1262 1538 ...
## $ topics : chr "['alternative energy', 'cars', 'climate change', 'culture', 'environment', 'global issues', 'science', 'sustain"| __truncated__ "['Africa', 'Asia', 'Google', 'demo', 'economics', 'global issues', 'health', 'statistics', 'global development'"| __truncated__ "['computers', 'entertainment', 'interface design', 'media', 'music', 'performance', 'simplicity', 'software', 'technology']" "['MacArthur grant', 'activism', 'business', 'cities', 'environment', 'green', 'inequality', 'politics', 'pollution']" ...
## $ views : int 3523392 14501685 1920832 2664069 65051954 1208138 4636596 3781244 3998282 460994 ...
## $ comments : int 272 628 124 219 4931 48 980 919 930 59 ...
## - attr(*, ".internal.selfref")=<externalptr>
## talk_id title speaker_1
## 1: 1 Averting the climate crisis Al Gore
## 2: 92 The best stats you've ever seen Hans Rosling
## 3: 7 Simplicity sells David Pogue
## 4: 53 Greening the ghetto Majora Carter
## 5: 66 Do schools kill creativity? Sir Ken Robinson
## 6: 49 Behind the design of Seattle's library Joshua Prince-Ramus
## occupations
## 1: {0: ['climate advocate']}
## 2: {0: ['global health expert; data visionary']}
## 3: {0: ['technology columnist']}
## 4: {0: ['activist for environmental justice']}
## 5: {0: ['author', 'educator']}
## 6: {0: ['architect']}
## about_speakers
## 1: {0: 'Nobel Laureate Al Gore focused the world’s attention on the global climate crisis. Now he’s showing us how we’re moving towards real solutions.\\r\\n'}
## 2: {0: 'In Hans Rosling’s hands, data sings. Global trends in health and economics come to vivid life. And the big picture of global development -- with some surprisingly good news -- snaps into sharp focus.'}
## 3: {0: 'David Pogue is the personal technology columnist for the <em>New York Times</em> and a tech correspondent for CBS News. He\\'s also one of the world\\'s bestselling how-to authors, with titles in the For Dummies series and his own line of ""Missing Manual"" books. '}
## 4: {0: 'Majora Carter redefined the field of environmental equality, starting in the South Bronx at the turn of the century. Now she is leading the local economic development movement across the USA.'}
## 5: {0: ""Creativity expert Sir Ken Robinson challenges the way we're educating our children. He champions a radical rethink of our school systems, to cultivate creativity and acknowledge multiple types of intelligence. ""}
## 6: {0: 'Joshua Prince-Ramus is best known as architect of the Seattle Central Library, already being hailed as a masterpiece of contemporary culture. Prince-Ramus was the founding partner of OMA New York—the American affiliate of the Office for Metropolitan Architecture (OMA) in the Netherlands—and served as its Principal until he renamed the firm REX in 2006.'}
## recorded_date published_date duration
## 1: 2006-02-25 2006-06-27 977
## 2: 2006-02-22 2006-06-27 1190
## 3: 2006-02-24 2006-06-27 1286
## 4: 2006-02-26 2006-06-27 1116
## 5: 2006-02-25 2006-06-27 1164
## 6: 2006-02-23 2006-07-10 1198
## topics
## 1: ['alternative energy', 'cars', 'climate change', 'culture', 'environment', 'global issues', 'science', 'sustainability', 'technology']
## 2: ['Africa', 'Asia', 'Google', 'demo', 'economics', 'global issues', 'health', 'statistics', 'global development', 'visualizations', 'math']
## 3: ['computers', 'entertainment', 'interface design', 'media', 'music', 'performance', 'simplicity', 'software', 'technology']
## 4: ['MacArthur grant', 'activism', 'business', 'cities', 'environment', 'green', 'inequality', 'politics', 'pollution']
## 5: ['children', 'creativity', 'culture', 'dance', 'education', 'parenting', 'teaching']
## 6: ['architecture', 'collaboration', 'culture', 'design', 'library']
## description
## 1: With the same humor and humanity he exuded in ""An Inconvenient Truth,"" Al Gore spells out 15 ways that individuals can address climate change immediately, from buying a hybrid to inventing a new, hotter brand name for global warming.
## 2: You've never seen data presented like this. With the drama and urgency of a sportscaster, statistics guru Hans Rosling debunks myths about the so-called ""developing world.""
## 3: New York Times columnist David Pogue takes aim at technology’s worst interface-design offenders, and provides encouraging examples of products that get it right. To funny things up, he bursts into song.
## 4: In an emotionally charged talk, MacArthur-winning activist Majora Carter details her fight for environmental justice in the South Bronx -- and shows how minority neighborhoods suffer most from flawed urban policy.
## 5: Sir Ken Robinson makes an entertaining and profoundly moving case for creating an education system that nurtures (rather than undermines) creativity.
## 6: Architect Joshua Prince-Ramus takes the audience on dazzling, dizzying virtual tours of three recent projects: the Central Library in Seattle, the Museum Plaza in Louisville and the Charles Wyly Theater in Dallas.
## views comments
## 1: 3523392 272
## 2: 14501685 628
## 3: 1920832 124
## 4: 2664069 219
## 5: 65051954 4931
## 6: 1208138 48
#to remove all the brackets, numbers and unnecessary punctuations in the occupations, transcript and topic columns
data_sub<- data_sub%>%
mutate(occupations = str_replace_all(occupations, "\\{|\\#|\\:|\\[|\\'|\\;|\\,|\\]|\\}", "")) %>%
mutate(topics = str_replace_all(topics, "\\{|\\#|\\:|\\[|\\'|\\;|\\,|\\]|\\}", ""))%>%
mutate(about_speakers = str_replace_all(about_speakers, "\\{|\\#|\\:|\\[|\\'|\\;|\\,|\\]|\\}", ""))
#to remove numbers
data_sub$occupations<-data_sub$occupations%>%
removeNumbers()
data_sub$about_speakers<-data_sub$about_speakers%>%
removeNumbers()
# to convert duration from seconds to minutes
data_sub<- data_sub %>%
mutate(duration =round(duration/60, 2))
#to extract the year
data_sub<-data_sub %>%
mutate(year = year(recorded_date))
# published_date
#to remove the name of the speaker from the talk description
data_sub$description <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data_sub$speaker_1),data_sub$description)
data_sub<- data_sub %>%
relocate(year)%>%
rename("speaker"="speaker_1")
data_sub %>%
select(-transcript)%>% # not selecting the transcript column so that it is easy to view the data for now
head()data_sub$first_name = word(data_sub$speaker, 1)
data_sub$first_name[word(data_sub$speaker, 1)=="Sir"] = word(data_sub$speaker[word(data_sub$speaker, 1)=="Sir"], 2) # replace the title to the first name
gender_df <- gender::gender(data_sub$first_name, method = "ssa", years = c(1960,2012)) %>%
select(first_name = name, gender)
gender_df <- unique(gender_df)
data_sub_gender = left_join(data_sub, gender_df, by = "first_name") Let’s make female out of the gender variable.
data_sub_gender$female <- c(0)
data_sub_gender$female[data_sub_gender$gender == "female"] <- 1
data_sub_gender$female <- as.factor(data_sub_gender$female)
summary(data_sub_gender$female) #str(data_sub_gender)Proportion of female speakers:
prop.table(table(data_sub_gender$female))##
## 0 1
## 0.6502401 0.3497599
Notice that \(35\%\) of the speakers are female.
Does talks relate to month or day of the weeks?
Dealing with date can be a challenging job. Should we
treat them as continuous variables or categorical ones? This highly
depends on the context and the goal of the study. In our situation it
makes sense that we are interested in knowing if people tend to give
lectures over the weekend.
Let us use functions in tidyverse to format the dates
and extract weekdays
weekdays <- weekdays(as.Date(data_sub_gender$recorded_date)) # get weekdays for each talk
weekdays_female <- weekdays(as.Date(data_sub_gender[data_sub_gender$female==1]$recorded_date)) # get weekdays for each talk
weekdays_male <- weekdays(as.Date(data_sub_gender[data_sub_gender$female==0]$recorded_date)) # get weekdays for each talk
months <- months(as.Date(data_sub_gender$recorded_date)) # get months
months_female <- months(as.Date(data_sub_gender[data_sub_gender$female==1]$recorded_date)) # get months
months_male <- months(as.Date(data_sub_gender[data_sub_gender$female==0]$recorded_date)) # get months Do people tend to give a talk over weekends? (months?)
par(mfrow=c(1,2))
pie(table(weekdays_male), main="Prop of reviews, male") # Pretty much evenly distributed
pie(table(weekdays_female), main="Prop of reviews, female") # Pretty much evenly distributedpar(mfrow=c(1,2))
pie(table(months_male))
pie(table(months_female)) Proportion of female talks: we don’t really see any pattern.
prop.table(table(data_sub_gender$female, weekdays), 2) # prop of the columns
prop.table(table(data_sub_gender$female, weekdays), 1) # prop of the rows
prop.table(table(data_sub_gender$female, months), 2) # prop of the columns
prop.table(table(data_sub_gender$female, months), 1) # prop of the rows## weekdays
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 0 0.6497797 0.6550580 0.6283186 0.7037037 0.6599045 0.6939655 0.5939227
## 1 0.3502203 0.3449420 0.3716814 0.2962963 0.3400955 0.3060345 0.4060773
## weekdays
## Friday Monday Saturday Sunday Thursday Tuesday
## 0 0.11465216 0.15351729 0.11037699 0.05169063 0.21492421 0.18771862
## 1 0.11496746 0.15039769 0.12147505 0.04049168 0.20607375 0.15401302
## weekdays
## Wednesday
## 0 0.16712009
## 1 0.21258134
## months
## April August December February January July June
## 0 0.6273684 0.7321429 0.5173913 0.7179878 0.5865385 0.6704871 0.6449864
## 1 0.3726316 0.2678571 0.4826087 0.2820122 0.4134615 0.3295129 0.3550136
## months
## March May November October September
## 0 0.7240664 0.6039604 0.5787140 0.6246057 0.6698565
## 1 0.2759336 0.3960396 0.4212860 0.3753943 0.3301435
## months
## April August December February January July
## 0 0.11581811 0.03186941 0.04624951 0.18305480 0.02370773 0.09094442
## 1 0.12798265 0.02169197 0.08026030 0.13376717 0.03109183 0.08315257
## months
## June March May November October September
## 0 0.09249903 0.13563933 0.04741547 0.10143801 0.07695297 0.05441119
## 1 0.09472162 0.09616775 0.05784526 0.13738250 0.08604483 0.04989154
I will use the tidytext library to tokenize the transcripts to find which words have the highest frequency of occurance in the ted talks, with respect to each gender.
custom_stop<- tribble(
~word, ~lexicon,
"laughter", "custom",
"applause", "custom",
"chris", "custom",
"anderson", "custom")
stop_new<- stop_words%>%
rbind(custom_stop)Let us use Wordcloud now to look for most frequent words¶. First let’s see the most frequently used words in transcripts.
transcript_female <- data_sub_gender %>% filter(female == 1) %>%
unnest_tokens(word, transcript)%>%
anti_join(stop_new, by="word")
transcript_male <- data_sub_gender %>% filter(female == 0) %>%
unnest_tokens(word, transcript)%>%
anti_join(stop_new, by="word")
t_transcript_count_female <- transcript_female %>% count(word)
t_transcript_count_male <- transcript_male %>% count(word)
set.seed(1)
par(mfrow=c(1,2))
wordcloud(t_transcript_count_female$word, t_transcript_count_female$n, max.words=30, color=wes_palette("Darjeeling1"), scale=c(4,0.4), ordered.colors=F, random.order=FALSE, use.r.layout=FALSE)
title("highest occuring in trans, female")
options(repr.plot.width = 8, repr.plot.height = 8)
wordcloud(t_transcript_count_male$word, t_transcript_count_male$n, max.words=30, color="darkblue", scale=c(4,0.4), ordered.colors=F, random.order=FALSE, use.r.layout=FALSE)
title("highest occuring in trans, male")options(repr.plot.width = 8, repr.plot.height = 8)Then let’s see the most frequently used words in topics.
topics_female <- data_sub_gender %>% filter(female == 1) %>%
unnest_tokens(word, topics)%>%
anti_join(stop_new, by="word")
topics_male <- data_sub_gender %>% filter(female == 0) %>%
unnest_tokens(word, topics)%>%
anti_join(stop_new, by="word")
t_topics_count_female <- topics_female %>% count(word)
t_topics_count_male <- topics_male %>% count(word)
set.seed(1)
par(mfrow=c(1,2))
wordcloud(t_topics_count_female$word[1:20], t_topics_count_female$n[1:20], max.words=100, color=wes_palette("Darjeeling1"), scale=c(4,1.2), ordered.colors=F, random.order=FALSE, use.r.layout=FALSE)
title("highest occuring in topics, female")
options(repr.plot.width = 8, repr.plot.height = 8)
wordcloud(t_topics_count_male$word[1:20], t_topics_count_male$n[1:20], max.words=100, color="darkblue", scale=c(4,1.2), ordered.colors=F, random.order=FALSE, use.r.layout=FALSE)
title("highest occuring in topics, male")options(repr.plot.width = 8, repr.plot.height = 8)How should we use a transcript or a topic list as predictors? Obviously, some sentences are informative. Also, words used to provide information, as well as sentiments, will show how people feel as well. We will turn a text into a vector of features, each of which represents the words that are used. The specific value of the feature for a given document tells us the frequency (how many occurrences) of that word in the document.
We do this by first collect all possible words (referred to as a library or bag of all words). We will then record frequency of each word used in the text.
tmUse word term frequency table to transform texts into word frequencies matrix.
bag of words: all the words appeared in
the documents say N (in general, very large)N values (notice: most of the
entries are 0, as most words will not occur in every
document)dtm) as an input to a
later modelLet’s first take a look at the texts we have:
transcript.text <- data_sub_gender$transcript # take the text out
topics.text <- data_sub_gender$topics # take the text out
length(transcript.text)
typeof(transcript.text)We now extract text into word term frequency table.
VCorpus(): create Volatile Corpusinspect(): display detailed info of a corpustranscript.corpus <- VCorpus(VectorSource(transcript.text))
topics.corpus <- VCorpus(VectorSource(topics.text))
# transcript.corpus
typeof(transcript.corpus) ## It is a list
typeof(topics.corpus) ## It is a list## [1] "list"
## [1] "list"
Inspect the corpus, say documents number 4 and 5.
tm_map()Before transforming the text into a word frequency matrix, we should
transform the text into a more standard format and clean the text by
removing punctuation, numbers and some common words that do not have
predictive power (a.k.a. stopwords;
e.g. pronouns, prepositions, conjunctions). This can be tedious but is
necessary for quality analyses. We use the tm_map()
function with different available transformations including
removeNumbers(), removePunctuation(),
removeWords(), stemDocument(), and
stripWhitespace(). It works like the apply()
class function to apply the function to corpus. Details of each step are
in the appendix. These cleaning techniques are not one-size-fits-all,
and the techniques appropriate for your data will vary based on
context.
# Converts all words to lowercase
transcript.corpus_clean <- tm_map(transcript.corpus, content_transformer(tolower))
# Removes common English stopwords (e.g. "with", "i")
transcript.corpus_clean <- tm_map(transcript.corpus_clean, removeWords, stopwords("english"))
# Removes any punctuation
# NOTE: This step may not be appropriate if you want to account for differences
# on semantics depending on which sentence a word belongs to if you end up
# using n-grams or k-skip-n-grams.
# Instead, periods (or semicolons, etc.) can be replaced with a unique
# token (e.g. "[PERIOD]") that retains this semantic meaning.
transcript.corpus_clean <- tm_map(transcript.corpus_clean, removePunctuation)
# Removes numbers
transcript.corpus_clean <- tm_map(transcript.corpus_clean, removeNumbers)
# Stem words
transcript.corpus_clean <- tm_map(transcript.corpus_clean, stemDocument, lazy = TRUE)
lapply(transcript.corpus_clean[4:5], as.character)## $`4`
## [1] "today — happi — heard sustain develop will save us howev ted often told real sustain polici agenda just feasibl especi larg urban area like new york citi peopl decisionmak power public privat sector realli feel though danger reason today part dog — abandon puppi found back rain back turn much bigger dog anticip came life fight huge wast facil plan east river waterfront despit fact small part new york citi alreadi handl percent entir citi commerci wast sewag treatment pellet plant sewag sludg plant four power plant world largest fooddistribut center well industri bring diesel truck trip area week area also one lowest ratio park peopl citi contact park depart seedgrant initi help develop waterfront project thought realli wellmean bit naiv live area life get river love facil mention earlier jog dog one morn pull thought just anoth illeg dump weed pile garbag stuff mention kept drag lo behold end lot river knew forgotten littl streetend abandon like dog brought worth save knew grow becom proud begin communityl revit new south bronx just like new dog idea got bigger imagin garner much support along way hunt point riversid park becam first waterfront park south bronx year leverag seed grant time million park fall go exchang marriag vow belov audienc whistl thank much applaus press button back time laughter applaus us live environment justic communiti canari coal mine feel problem right now time environment justic may familiar term goe someth like communiti saddl environment burden less environment benefit unfortun race class extrem reliabl indic one might find good stuff like park tree one might find bad stuff like power plant wast facil black person america twice like white person live area air pollut pose greatest risk health five time like live within walk distanc power plant chemic facil landus decis creat hostil condit lead problem like obes diabet asthma someon leav home go brisk walk toxic neighborhood percent obes rate high even countri diabet come one four south bronx children asthma asthma hospit rate seven time higher nation averag impact come everyon way pay dear solid wast cost health problem associ pollut odious cost imprison young black latino men possess untold amount untap potenti fifti percent resid live poverti line percent us unemploy lowincom citizen often use emergencyroom visit primari care come high cost taxpay produc proport benefit poor peopl still poor still unhealthi fortun mani peopl like strive solut compromis live lowincom communiti color short term destroy us long term none us want common els common well first incred goodlook laughter graduat high school colleg postgradu degre travel interest place kid earli teen financi stabl never imprison ok good laughter besid black woman differ way watch near half build neighborhood burn big brother lenni fought vietnam gun block home jesus grew crack hous across street yeah poor black child ghetto thing make differ thing common set apart peopl communiti two world enough heart fight justic thing get differ us late s dad — pullman porter son slave — bought hous hunt point section south bronx year later marri mom time communiti most white workingclass neighborhood dad alon other like pursu version american dream white flight becam common south bronx mani citi around countri redlin use bank wherein certain section citi includ deem limit sort invest mani landlord believ profit torch build collect insur money rather sell condit — dead injur former tenant notwithstand hunt point former walkwork communiti now resid neither work home walk nation highway construct boom ad problem new york state robert mose spearhead aggress highwayexpans campaign one primari goal make easier resid wealthi communiti westchest counti go manhattan south bronx lie stand chanc resid often given less month notic build raze peopl displac common percept pimp pusher prostitut south bronx told earliest day noth good go come communiti bad ugli reflect now famili properti worthless save home luckili home love insid along help teacher mentor friend along way enough now stori import plan perspect econom degrad beget environment degrad beget social degrad disinvest began s set stage environment injustic come antiqu zone landus regul still use day continu put pollut facil neighborhood factor taken consider landus polici decid cost associ decis pay profit anyth justifi local communiti goe plan — quot — best interest mind realiz decid time plan small park told earlier first stage build greenway movement south bronx wrote onequartermillion dollar feder transport grant design plan waterfront esplanad dedic street bike path physic improv help inform public polici regard traffic safeti placement wast facil done proper compromis communiti qualiti life provid opportun physic activ well local econom develop think bike shop juic stand secur million dollar build firstphas project lafayett avenu — redesign mathew nielsen landscap architect path construct ll connect south bronx acr randal island park right now separ feet water link will chang nurtur natur environ abund will give us back even run project call bronx environment stewardship train provid job train field ecolog restor folk communiti skill compet wellpay job littl littl seed area greencollar job — peopl financi person stake environ sheridan expressway underutil relic robert mose era built regard neighborhood divid even rush hour goe virtual unus communiti creat altern transport plan allow remov highway opportun now bring togeth stakehold reenvis acr can better util parkland afford hous local econom develop also built new york citi first green cool roof demonstr project top offic cool roof highlyreflect surfac absorb solar heat pass build atmospher green roof soil live plant can use instead petroleumbas roof materi absorb heat contribut urban heat island effect degrad sun turn breath green roof also retain percent rainfal reduc citi need fund cost endpip solut — incident often locat environment justic communiti like mine provid habitat littl friend butterfli laughter cool anyway demonstr project springboard green roof instal busi bring job sustain econom activ south bronx green new black laughter applaus like anyway know chris told us pitch sinc attent need investor end pitch better ask forgiv permiss anyway — laughter applaus ok katrina prior katrina south bronx new orlean ninth ward lot common larg popul poor peopl color hotb cultur innov think hiphop jazz waterfront communiti host industri resid close proxim one anoth postkatrina era still common best ignor malign abus worst neglig regulatori agenc pernici zone lax government account neither destruct ninth ward south bronx inevit emerg valuabl lesson dig simpli nation symbol urban blight problem solv empti campaign promis presid come gone now will let gulf coast languish decad two like south bronx will take proactiv step learn homegrown resourc grassroot activist born desper communiti like mine now listen expect individu corpor govern make world better place right moral present today repres like tini littl bit clue tell later want know laughter — know bottom line one percept motiv peopl end interest like call tripl bottom line sustain develop can produc develop potenti creat posit return concern develop govern communiti project go present happen new york citi oper comprehens urbanplan deficit parad govern subsidi go propos bigbox stadium develop south bronx scant coordin citi agenc deal cumul effect increas traffic pollut solid wast impact open space approach local econom job develop lame even funni top world richest sport team replac hous ruth built destroy two welllov communiti park now even less stat told earlier although less percent south bronx resid car project includ thousand new park space yet zip term mass public transit now miss larger debat comprehens costbenefit analysi fix unhealthi environmentallychalleng communiti versus incorpor structur sustain chang agenc work close columbia univers other shine light issu now get straight antidevelop citi wilder preserv embrac inner capitalist — laughter probabl need laughter problem develop make money enough preced show sustain communityfriend develop can still make fortun fellow tedster bill mcdonough amori lovin — hero mine way — shown can actual problem develop hyperexploit polit vulner communiti profit continu shame upon us respons futur creat one thing remind greater possibl learn visionari citi version global take bogota poor latino surround runaway gun violenc drug traffick reput unlik south bronx howev citi bless late s highlyinfluenti mayor name enriqu peñalosa look demograph bogotano car yet huge portion citi resourc dedic serv mayor can someth administr narrow key municip thoroughfar five lane three outlaw park street expand pedestrian walkway bike lane creat public plaza creat one effici bus masstransit system entir world brilliant effort near impeach peopl began see put first issu reflect dayday live incred thing happen peopl stop litter crime rate drop street aliv peopl administr attack sever typic urban problem one time thirdworld budget excus countri sorri bottom line peoplefirst agenda meant penal actual afford car rather provid opportun bogotano particip citi resurg develop come expens major popul still consid radic idea us bogota exampl power chang howev bless gift influenc valu inform exchang use influenc support comprehens sustain chang everywher just talk ted nationwid polici agenda tri build know polit person help make green new black help make sustain sexi make part dinner cocktail convers help fight environment econom justic support invest triplebottomlin return help democrat sustain bring everyon tabl insist comprehens plan can address everywher oh good glad littl time listen — spoke mr gore day breakfast ask environment justic activist go includ new market strategi respons grant program think understood ask fund make offer applaus troubl top approach still around now get wrong need money laughter grassroot group need tabl decisionmak process percent energi mr gore remind us wast everi day add wast energi intellig hardearn experi count applaus come far meet like pleas wast work togeth can becom one small rapidlygrow group individu actual audac courag believ actual can chang world might come confer differ station life believ share one incred power thing noth lose everyth gain ciao bello applaus"
##
## $`5`
## [1] "good morn audienc good great blown away whole thing fact leav laughter three theme run confer relev want talk one extraordinari evid human creativ present peopl just varieti rang second put us place idea go happen term futur idea may play interest educ actual find everybodi interest educ find interest dinner parti say work educ — actual often dinner parti frank laughter work educ ask laughter never ask back curious strang say somebodi know say say work educ can see blood run face like oh god laughter one night week laughter ask educ pin wall one thing goe deep peopl right like religion money thing big interest educ think huge vest interest part educ meant take us futur grasp think children start school year will retir nobodi clue despit expertis parad past four day world will look like five year time yet meant educ unpredict think extraordinari third part agre nonetheless realli extraordinari capac children — capac innov mean sirena last night marvel just see except think speak except whole childhood person extraordinari dedic found talent content kid tremend talent squander pretti ruthless want talk educ want talk creativ content creativ now import educ literaci treat status applaus thank applaus way thank much laughter minut left laughter well born laughter heard great stori recent — love tell — littl girl draw lesson six back draw teacher said girl hard ever paid attent draw lesson teacher fascin went said draw girl said draw pictur god teacher said nobodi know god look like girl said will minut laughter son four england — actual four everywher honest laughter strict wherev went four year nativ play rememb stori laughter big big stori mel gibson sequel may seen laughter nativ ii jame got part joseph thrill consid one lead part place cram full agent tshirt jame robinson joseph laughter speak know bit three king come come bear gift gold frankincens myrrh realli happen sit think just went sequenc talk littl boy afterward said ok said yeah wrong just switch three boy came fouryearold tea towel head put box first boy said bring gold second boy said bring myrrh third boy said frank sent laughter thing common kid will take chanc know go right frighten wrong mean say wrong thing creativ know prepar wrong never come anyth origin — prepar wrong time get adult kid lost capac becom frighten wrong run compani like stigmat mistak now run nation educ system mistak worst thing can make result educ peopl creativ capac picasso said said children born artist problem remain artist grow believ passion grow creativ grow rather get educ live stratfordavon five year ago fact move stratford los angel can imagin seamless transit laughter actual live place call snitterfield just outsid stratford shakespear father born struck new thought think shakespear father think shakespear child shakespear seven never thought mean seven point somebodi english class laughter annoy laughter must tri harder laughter sent bed dad shakespear go bed now william shakespear put pencil laughter stop speak like laughter confus everybodi laughter anyway move stratford los angel just want say word transit actual son want come got two kid now daughter want come los angel love girlfriend england love life sarah known month laughter mind fourth anniversari long time realli upset plane said never find anoth girl like sarah rather pleas frank — laughter main reason leav countri laughter someth strike move america travel around world everi educ system earth hierarchi subject everi one matter go think otherwis top mathemat languag human bottom art everywher earth pretti much everi system hierarchi within art art music normal given higher status school drama danc educ system planet teach danc everi day children way teach mathemat think rather import think math import danc children danc time allow bodi miss meet laughter truth happen children grow start educ progress waist focus head slight one side visit educ alien say public educ think conclud look output realli succeed everyth get browni point winner — think conclud whole purpos public educ throughout world produc univers professor peopl come top use one laughter like univers professor know hold highwat mark human achiev just form life anoth form life rather curious say affect someth curious professor experi — typic — live head live slight one side disembodi know kind liter way look upon bodi form transport head laughter way get head meet laughter want real evid bodi experi way get along residenti confer senior academ pop discothequ final night laughter will see grown men women writh uncontrol beat laughter wait end can go home write paper laughter educ system predic idea academ abil reason around world public system educ realli th centuri came meet need industri hierarchi root two idea number one use subject work top probabl steer benign away thing school kid thing like ground never get job right music go musician art artist benign advic — now profound mistaken whole world engulf revolut second academ abil realli come domin view intellig univers design system imag think whole system public educ around world protract process univers entranc consequ mani high talent brilliant creativ peopl think thing good school valu actual stigmat think afford go way next year accord unesco peopl worldwid will graduat educ sinc begin histori peopl combin thing talk technolog transform effect work demographi huge explos popul sudden degre worth anyth true student degre job job want one want one frank laughter now kid degre often head home carri play video game need ma previous job requir ba now need phd process academ inflat indic whole structur educ shift beneath feet need radic rethink view intellig know three thing intellig one divers think world way experi think visual think sound think kinesthet think abstract term think movement second intellig dynam look interact human brain heard yesterday number present intellig wonder interact brain divid compart fact creativ — defin process origin idea valu — often come interact differ disciplinari way see thing way shaft nerv join two halv brain call corpus callosum thicker women follow helen yesterday probabl women better multitask raft research know person life wife cook meal home often thank laughter good thing cook deal peopl phone talk kid paint ceil — laughter openheart surgeri cook door shut kid phone hook come get annoy say terri pleas tri fri egg laughter give break laughter actual know old philosoph thing tree fall forest nobodi hear happen rememb old chestnut saw great tshirt recent said man speak mind forest woman hear still wrong laughter third thing intellig distinct new book moment call epiphani base seri interview peopl discov talent fascin peopl got realli prompt convers wonder woman mayb peopl never heard gillian lynn heard choreograph everybodi know work cat phantom opera wonder use board royal ballet can see laughter gillian lunch one day said get dancer interest school realli hopeless school s wrote parent said think gillian learn disord concentr fidget think now say adhd s adhd invent point avail condit laughter peopl awar laughter anyway went see specialist oakpanel room mother led sat chair end sat hand minut man talk mother problem gillian school disturb peopl homework alway late littl kid eight end doctor went sat next gillian said listen thing mother told need speak privat wait back long went left went room turn radio sit desk got room said mother just stand watch minut left room feet move music watch minut turn mother said mrs lynn gillian sick dancer take danc school said happen said tell wonder walk room full peopl like — peopl sit still peopl move think move think ballet tap jazz modern contemporari eventu audit royal ballet school becam soloist wonder career royal ballet eventu graduat royal ballet school found gillian lynn danc compani met andrew lloyd webber respons success music theater product histori given pleasur million multimillionair somebodi els might put medic told calm applaus think come al gore spoke night ecolog revolut trigger rachel carson believ hope futur adopt new concept human ecolog one start reconstitut concept rich human capac educ system mine mind way stripmin earth particular commod futur serv us rethink fundament principl educ children wonder quot jona salk said insect disappear earth within year life earth end human be disappear earth within year form life flourish right ted celebr gift human imagin care now use gift wise avert scenario talk way see creativ capac rich see children hope task educ whole can face futur way — may see futur will job help make someth thank much applaus"
# for topics
# Converts all words to lowercase
topics.corpus_clean <- tm_map(topics.corpus, content_transformer(tolower))
# Removes common English stopwords (e.g. "with", "i")
topics.corpus_clean <- tm_map(topics.corpus_clean, removeWords, stopwords("english"))
# Removes any punctuation
# NOTE: This step may not be appropriate if you want to account for differences
# on semantics depending on which sentence a word belongs to if you end up
# using n-grams or k-skip-n-grams.
# Instead, periods (or semicolons, etc.) can be replaced with a unique
# token (e.g. "[PERIOD]") that retains this semantic meaning.
topics.corpus_clean <- tm_map(topics.corpus_clean, removePunctuation)
# Removes numbers
topics.corpus_clean <- tm_map(topics.corpus_clean, removeNumbers)
# Stem words
topics.corpus_clean <- tm_map(topics.corpus_clean, stemDocument, lazy = TRUE)
lapply(topics.corpus_clean[4:5], as.character)## $`4`
## [1] "macarthur grant activ busi citi environ green inequ polit pollut"
##
## $`5`
## [1] "children creativ cultur danc educ parent teach"
Now we transform each transcript into a word frequency matrix using
the function DocumentTermMatrix().
transcript.dtm1 <- DocumentTermMatrix( transcript.corpus_clean ) ## library = collection of words for all documents
class(transcript.dtm1)## [1] "DocumentTermMatrix" "simple_triplet_matrix"
inspect(transcript.dtm1) # typeof(dtm1) #length(dimnames(dtm1)$Terms)## <<DocumentTermMatrix (documents: 3957, terms: 52898)>>
## Non-/sparse entries: 1583939/207733447
## Sparsity : 99%
## Maximal term length: 42
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can get just know like now one peopl thing think
## 2318 41 17 40 22 40 31 25 75 30 71
## 2349 27 20 41 11 39 16 22 77 18 68
## 348 26 27 13 24 22 37 38 65 34 21
## 3723 60 15 29 26 22 25 33 47 24 29
## 3913 43 58 29 70 24 26 26 80 41 16
## 3918 68 58 23 43 38 43 40 63 26 63
## 3919 37 32 33 47 32 62 38 79 35 74
## 3926 90 25 49 32 33 60 43 62 37 80
## 3930 23 40 35 29 32 12 25 60 41 75
## 3935 40 36 25 56 31 20 22 44 36 58
topics.dtm1 <- DocumentTermMatrix( topics.corpus_clean ) ## library = collection of words for all documents
class(topics.dtm1)## [1] "DocumentTermMatrix" "simple_triplet_matrix"
inspect(topics.dtm1) # typeof(dtm1) #length(dimnames(dtm1)$Terms)## <<DocumentTermMatrix (documents: 3957, terms: 459)>>
## Non-/sparse entries: 34492/1781771
## Sparsity : 98%
## Maximal term length: 14
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs anim chang cultur design global health human scienc social technolog
## 1134 0 0 0 2 0 1 0 2 0 1
## 1270 0 1 0 1 0 0 0 1 0 1
## 2090 0 1 1 0 2 0 1 0 0 0
## 2108 0 1 1 0 2 0 1 0 1 0
## 2118 0 2 0 0 2 0 1 0 1 1
## 2211 0 2 0 0 2 0 0 0 1 1
## 2241 0 0 0 0 2 0 0 0 0 0
## 2302 0 0 1 0 1 0 2 1 0 1
## 2367 0 1 0 3 1 0 1 1 1 1
## 2401 0 1 0 0 2 1 1 0 2 0
Take a look at the dtm.
colnames(transcript.dtm1)[7150:7161] # the last a few words in the bag
# another way to get list of words
# dimnames(dtm1)$Terms[7000:7161]
dim(as.matrix(transcript.dtm1)) # we use 7161 words as predictors
colnames(topics.dtm1)[300:310] # the last a few words in the bag
# another way to get list of words
# dimnames(dtm1)$Terms[7000:7161]
dim(as.matrix(topics.dtm1)) # we use 7161 words as predictorsDocument 1, which is row1 in the dtm.
inspect(transcript.dtm1[1,]) #Non-/sparse entries: number of non-zero entries vs. number of zero entries
inspect(topics.dtm1[1,]) #Non-/sparse entries: number of non-zero entries vs. number of zero entriestranscript.dtm1 has 42 distinctive words; in other words, there are 42 non-zero cells out of 52898 bag of words. topics.dtm1 has 14 distinctive words; in other words, there 14 non-zero cells out of 459 bag of words.
as.matrix(transcript.dtm1[1, 1:50]) # most of the cells are 0
as.matrix(topics.dtm1[1, 1:50]) # most of the cells are 0This is because review 1 only consists of 28 words after all the cleansing.
sum(as.matrix(transcript.dtm1[1,]))
sum(as.matrix(topics.dtm1[1,]))We may
colnames(as.matrix(transcript.dtm1[1, ]))[which(as.matrix(transcript.dtm1[1, ]) != 0)]
colnames(as.matrix(topics.dtm1[1, ]))[which(as.matrix(topics.dtm1[1, ]) != 0)]as.character(transcript.corpus[[1]]) #original text
as.character(topics.corpus[[1]]) #original textWe inspect a few rows and columns of dtm1.
inspect(transcript.dtm1[1:5, 1000:1010]) # very sparse
inspect(topics.dtm1[1:5, 300:310]) # very sparseMany words do not appear nearly as often as others. If your cleaning was done appropriately, it will hopefully not lose much of the information if we drop such rare words. So, we first cut the bag to only include the words appearing at least 1% (or the frequency of your choice) of the time. This reduces the dimension of the features extracted to be analyzed.
transcript.threshold <- .005*length(transcript.corpus_clean) # 1% of the total documents
transcript.words.05 <- findFreqTerms(transcript.dtm1, lowfreq=transcript.threshold) # words appearing at least among 1% of the documents
length(transcript.words.05)
transcript.words.05[580:600]
topics.threshold <- .005*length(topics.corpus_clean) # 1% of the total documents
topics.words.05 <- findFreqTerms(topics.dtm1, lowfreq=topics.threshold) # words appearing at least among 1% of the documents
length(topics.words.05)
topics.words.05[200:210]transcript.dtm.05 <- DocumentTermMatrix(transcript.corpus_clean, control = list(dictionary = transcript.words.05))
dim(as.matrix(transcript.dtm.05))
colnames(transcript.dtm.05)[40:50]
topics.dtm.05 <- DocumentTermMatrix(topics.corpus_clean, control = list(dictionary = topics.words.05))
dim(as.matrix(topics.dtm.05))
colnames(topics.dtm.05)[40:50]removeSparseTerms():
Anther way to reduce the size of the bag is to use
removeSparseTerms
transcript.dtm.05.2 <- removeSparseTerms(transcript.dtm1, 1-.005) # control sparsity < .99
inspect(transcript.dtm.05.2)
# colnames(dtm.05.2)[1:50]
# words that are in dtm.05 but not in dtm.05.2
colnames(transcript.dtm.05)[!(colnames(transcript.dtm.05) %in% colnames(transcript.dtm.05.2))]
topics.dtm.05.2 <- removeSparseTerms(topics.dtm1, 1-.005) # control sparsity < .99
inspect(topics.dtm.05.2)
# colnames(dtm.05.2)[1:50]
# words that are in dtm.05 but not in dtm.05.2
colnames(topics.dtm.05)[!(colnames(topics.dtm.05) %in% colnames(topics.dtm.05.2))]We end up with two different bags because
findFreqTerms(): counts a word multiple times if it
appears multiple times in one document.removeSparseTerms(): keep words that appear at least
once in X% of documents.DTMWe consolidate all possible processing steps to the following clean
R-chunk, turning texts (input) into
Document Term Frequency which is a sparse matrix (output)
to be used in the down-stream analyses.
All the tm_map() can be called inside
DocumentTermMatrix under parameter called
control. Here is how.
# Turn texts to corpus
transcript.corpus <- VCorpus(VectorSource(transcript.text))
# Control list for creating our DTM within DocumentTermMatrix
# Can tweak settings based off if you want punctuation, numbers, etc.
control_list <- list( tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = stopwords("english"),
# stopwords = TRUE,
stemming = TRUE)
# dtm with all terms:
transcript.dtm.05.long <- DocumentTermMatrix(transcript.corpus, control = control_list)
#inspect(dtm.05.long)
# kick out rare words
transcript.dtm.05<- removeSparseTerms(transcript.dtm.05.long, 1-.005)
inspect(transcript.dtm.05)## <<DocumentTermMatrix (documents: 3957, terms: 6916)>>
## Non-/sparse entries: 1495946/25870666
## Sparsity : 95%
## Maximal term length: 16
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can get just know like now one peopl thing think
## 2318 41 17 40 22 40 31 25 75 30 71
## 2349 27 20 41 11 39 16 22 77 18 68
## 348 26 27 13 24 22 37 38 65 34 21
## 3723 60 15 29 26 22 25 33 47 24 29
## 3913 43 58 29 70 24 26 26 80 41 16
## 3918 68 58 23 43 38 43 40 63 26 63
## 3919 37 32 33 47 32 62 38 79 35 74
## 3926 90 25 49 32 33 60 43 62 37 80
## 3930 23 40 35 29 32 12 25 60 41 75
## 3935 40 36 25 56 31 20 21 44 36 58
# look at the document 1 before and after cleaning
# inspect(transcript.corpus[[1]])
# after cleaning
# colnames(as.matrix(transcript.dtm1[1, ]))[which(as.matrix(transcript.dtm1[1, ]) != 0)]
topics.corpus <- VCorpus(VectorSource(topics.text))
# Control list for creating our DTM within DocumentTermMatrix
# Can tweak settings based off if you want punctuation, numbers, etc.
control_list <- list( tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = stopwords("english"),
# stopwords = TRUE,
stemming = TRUE)
# dtm with all terms:
topics.dtm.05.long <- DocumentTermMatrix(topics.corpus, control = control_list)
#inspect(dtm.05.long)
# kick out rare words
topics.dtm.05<- removeSparseTerms(topics.dtm.05.long, 1-.005)
inspect(topics.dtm.05)## <<DocumentTermMatrix (documents: 3957, terms: 288)>>
## Non-/sparse entries: 32698/1106918
## Sparsity : 97%
## Maximal term length: 13
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs anim chang cultur design global health human scienc social technolog
## 1270 0 1 0 1 0 0 0 1 0 1
## 2090 0 1 1 0 2 0 1 0 0 0
## 2108 0 1 1 0 2 0 1 0 1 0
## 2118 0 2 0 0 2 0 1 0 1 1
## 2191 0 2 0 0 2 0 1 0 1 0
## 2302 0 0 1 0 1 0 2 1 0 1
## 2318 0 2 0 0 2 0 2 0 1 1
## 2367 0 1 0 3 1 0 1 1 1 1
## 2401 0 1 0 0 2 1 1 0 2 0
## 3234 1 0 2 0 0 0 1 0 0 0
# look at the document 1 before and after cleaning
# inspect(topics.corpus[[1]])
# after cleaning
# colnames(as.matrix(topics.dtm1[1, ]))[which(as.matrix(topics.dtm1[1, ]) != 0)]We have included n-grams here as a reference. We also implemented how to get the DTM. But will skip this section.
The approach we have taken with bag of words may seem very naive. What we have done is essentially looked at the frequencies of word occurrences in each document while throwing out any of the grammatical structure related to word ordering, punctuation, etc.
Consider the following two sentences:
When we read these, we realize that the ordering of the words plays an important role in the meaning. While both sentences might be treated similarly by bag-of-words, each having one occurrence of the word “worst”, we know by the surrounding context that the first review is negative while the other is positive.
Fortunately, there are extensions to bag-of-words that let us account for these nuances related to word ordering and related semantics! Instead of looking at the frequencies of individual words, what if we look at the frequencies of pairs, triples, or even n consecutive words? N-grams do exactly this, typically looking at the number of occurrences of all n-tuples of consecutive words from all of our documents of interest. There are even further generalizations like skip-grams that let us look at consecutive sequences of words with some gaps in the middle, but this is beyond anything that you will be required to do in the class (though it could be interesting to try for your final project!).
While this gives the added benefit of capturing more nuances in the language, you can probably imagine that this would blow up our feature space really quickly, so in practice most people don’t go beyond bi- or tri-grams. Like we had in the case of bag-of-words, most of our n-grams are also going to be pretty rare, especially for higher values of n, so we should keep in mind that our feature matrix is going to be even more sparse than before.
An example of bigrams on the sentence “I love data science” is “I love”, “love data”, and “data science”. Generally, we also compute the 1-grams as well, which coincide with the bag-of-words from before.
Using the tm package along with a custom tokenizer, we
can now implement n-grams using R! As is, the tokenizer is set to
produce bigrams but you should be able to change this by tweaking the
variable n.
We next prepare a clean corpus
transcript.corpus <- VCorpus(VectorSource(transcript.text))
topics.corpus <- VCorpus(VectorSource(topics.text))
# control_list <- list( tolower = TRUE,
# removePunctuation = TRUE,
# removeNumbers = TRUE,
# stopwords = stopwords("english"),
# stemming = TRUE)
# # dtm with all terms:
# dtm.10.long <- DocumentTermMatrix(transcript.corpus, control = control_list)
# Converts all words to lowercase
transcript.corpus_clean <- tm_map(transcript.corpus, content_transformer(tolower))
topics.corpus_clean <- tm_map(topics.corpus, content_transformer(tolower))
# Removes common English stopwords (e.g. "with", "i")
transcript.corpus_clean <- tm_map(transcript.corpus_clean, removeWords, stopwords("english"))
topics.corpus_clean <- tm_map(topics.corpus_clean, removeWords, stopwords("english"))
# Removes any punctuation
# NOTE: This step may not be appropriate if you want to account for differences
# on semantics depending on which sentence a word belongs to if you end up
# using n-grams or k-skip-n-grams.
# Instead, periods (or semicolons, etc.) can be replaced with a unique
# token (e.g. "[PERIOD]") that retains this semantic meaning.
transcript.corpus_clean <- tm_map(transcript.corpus_clean, removePunctuation)
topics.corpus_clean <- tm_map(topics.corpus_clean, removePunctuation)
# Removes numbers
transcript.corpus_clean <- tm_map(transcript.corpus_clean, removeNumbers)
topics.corpus_clean <- tm_map(topics.corpus_clean, removeNumbers)
# Stem words
transcript.corpus_clean <- tm_map(transcript.corpus_clean, stemDocument, lazy = TRUE)
topics.corpus_clean <- tm_map(topics.corpus_clean, stemDocument, lazy = TRUE)
transcript.text_clean <- data.frame(text = unlist(sapply(transcript.corpus_clean, `[`, "content")),
stringsAsFactors=F)
topics.text_clean <- data.frame(text = unlist(sapply(topics.corpus_clean, `[`, "content")),
stringsAsFactors=F)
data_sub_gender_clean = data.frame(data_sub_gender, transcript.text_clean, topics.text_clean)
data_sub_gender_clean = data_sub_gender_clean %>% mutate(transcript_clean = text, topics_clean = text.1)
names(data_sub_gender_clean)The following part runs too slow. So I dropped the bigram idea.
# data_sub_gender_clean_trans2gram <- data_sub_gender_clean %>%
# unnest_tokens(bigram, transcript_clean, token = "ngrams", n = 2) %>% count(talk_id,bigram) %>% spread(bigram, n)
#
#
# # colSums(!is.na(data_sub_gender_clean_trans2gram))
#
#
# data_sub_gender_clean_trans2gram[,2:dim(data_sub_gender_clean_trans2gram)[2]][is.na(data_sub_gender_clean_trans2gram[,2:dim(data_sub_gender_clean_trans2gram)[2]])==0] <- 1
# data_sub_gender_clean_trans2gram[,2:dim(data_sub_gender_clean_trans2gram)[2]][is.na(data_sub_gender_clean_trans2gram[,2:dim(data_sub_gender_clean_trans2gram)[2]])==1] <- 0
#
#
# thresh = dim(data_sub_gender_clean_trans2gram)[1]*0.005
#
# temp = data_sub_gender_clean_trans2gram %>% colSums()
# bigrams_trans = data_sub_gender_clean_trans2gram %>% select(which(temp[2:length(temp)] > thresh)+1)
#
# colnames(bigrams_trans)Once we have turned a text into a vector, we can then apply any
methods suitable for the settings. In our case we will use logistic
regression models and LASSO to explore the relationship between
female and text.
The following chunk output a data frame called data2, combining “user_id”, “stars”, “date”, “female” and all the tf’s of each word.
names(data_sub_gender)
# Combine the original data with the text matrix
# transcript.data1.temp <- data.frame(data_sub_gender, as.matrix(transcript.dtm.05), bigrams_trans)
transcript.data1.temp <- data.frame(data_sub_gender, as.matrix(transcript.dtm.05))
dim(transcript.data1.temp)
names(transcript.data1.temp)[1:30]
#str(data1.temp)
# data2 consists of date, female and all the top 1% words
# data2 <- transcript.data1.temp[, c(1,7, 8,11, 14:ncol(transcript.data1.temp))]
# names(data2)[1:20]
# dim(data2) ### remember we have only run 1000 rows
#### We have previously run the entire 100,000 rows and output the DTM out.
### if not, run and write as csv
if(!file.exists("data/ted_en_tm_freq_transcript.csv")) {
fwrite(transcript.data1.temp, "data/ted_en_tm_freq_transcript.csv", row.names = FALSE)
}
# Combine the original data with the text matrix
topics.data1.temp <- data.frame(data_sub_gender,as.matrix(topics.dtm.05))
dim(topics.data1.temp)
names(topics.data1.temp)[1:30]
#str(data1.temp)
# data2 consists of date, female and all the top 1% words
# data2 <- topics.data1.temp[, c(1,7, 8,11, 14:ncol(topics.data1.temp))]
# names(data2)[1:20]
# dim(data2) ### remember we have only run 1000 rows
#### We have previously run the entire 100,000 rows and output the DTM out.
### if not, run and write as csv
if(!file.exists("data/ted_en_tm_freq_topics.csv")) {
fwrite(topics.data1.temp, "data/ted_en_tm_freq_topics.csv", row.names = FALSE)
}Let’s first read in the processed data with text being a vector.
transcript.data2 <- fread("data/ted_en_tm_freq_transcript.csv") #dim(data2)
names(transcript.data2)[1:20] # notice that user_id, stars and date are in the data2
dim(transcript.data2)
transcript.data2$female <- as.factor(transcript.data2$female)
table(transcript.data2$female)
#str(transcript.data2) object.size(transcript.data2) 119Mb
topics.data2 <- fread("data/ted_en_tm_freq_topics.csv") #dim(data2)
names(topics.data2)[1:20] # notice that user_id, stars and date are in the data2
dim(topics.data2)
topics.data2$female <- as.factor(topics.data2$female)
table(topics.data2$female)
#str(topics.data2) object.size(topics.data2) 49MbAs one standard machine learning process, we first split data into two sets one training data and the other testing data. We use training data to build models, choose models etc and make final recommendations. We then report the performance using the testing data.
Reserve 10000 randomly chosen rows as our test data
(data2.test) and the remaining 90000 as the training data
(data2.train)
set.seed(911) # for the purpose of reporducibility
transcript.n <- nrow(transcript.data2)
transcript.test.index <- sample(transcript.n, round(0.1*transcript.n))
# length(test.index)
transcript.data2.test <- transcript.data2[transcript.test.index, -c(1:16)] # only keep female and the texts
transcript.data2.train <- transcript.data2[-transcript.test.index, -c(1:16)]
dim(transcript.data2.train)
set.seed(1) # for the purpose of reporducibility
topics.n <- nrow(topics.data2)
topics.test.index <- sample(topics.n, round(0.1*transcript.n))
# length(test.index)
topics.data2.test <- topics.data2[topics.test.index, -c(1:16)] # only keep female and the texts
topics.data2.train <- topics.data2[-topics.test.index, -c(1:16)]
dim(topics.data2.train)We first explore a logistic regression model using LASSO. The
following R-chunk runs a LASSO model with \(\alpha=.99\). The reason we take an elastic
net is to enjoy the nice properties from both LASSO (impose
sparsity) and Ridge (computationally stable).
LASSO takes sparse design matrix as an input. So make sure to extract the sparse matrix first as the input in cv.glm(). It takes about 1 minute to run cv.glm() with sparse matrix or 11 minutes using the regular design matrix.
### or try `sparse.model.matrix()` which is much faster
y <- transcript.data2.train$female
X1 <- sparse.model.matrix(female~., data=transcript.data2.train)[, -1]
set.seed(2)
transcript.result.lasso <- cv.glmnet(X1, y, alpha=.99, family="binomial")
# 1.25 minutes in my MAC
plot(transcript.result.lasso)# this this may take you long time to run, we save result.lasso
saveRDS(transcript.result.lasso, file="data/TextMining_lasso_transcript.RDS")
# result.lasso can be assigned back by
# result.lasso <- readRDS("data/TextMining_lasso.RDS")
# number of non-zero words picked up by LASSO when using lambda.1se
coef.1se <- coef(transcript.result.lasso, s="lambda.1se")
lasso.words <- coef.1se@Dimnames[[1]] [coef.1se@i][-1] # non-zero variables without intercept.
summary(lasso.words)
# or our old way
coef.1se <- coef(transcript.result.lasso, s="lambda.1se")
coef.1se <- coef.1se[which(coef.1se !=0),]
lasso.words <- rownames(as.matrix(coef.1se))[-1]
summary(lasso.words)
y <- topics.data2.train$female
X1 <- sparse.model.matrix(female~., data=topics.data2.train)[, -1]
set.seed(2)
topics.result.lasso <- cv.glmnet(X1, y, alpha=.99, family="binomial")
# 1.25 minutes in my MAC
plot(topics.result.lasso)# this this may take you long time to run, we save result.lasso
saveRDS(topics.result.lasso, file="data/TextMining_lasso_topics.RDS")
# result.lasso can be assigned back by
# result.lasso <- readRDS("data/TextMining_lasso.RDS")
### cv.glmnt with the non-sparse design matrix takes much longer
# X <- as.matrix(data2.train[, -1]) # we can use as.matrix directly her
#### Be careful to run the following LASSO.
#set.seed(2)
#result.lasso <- cv.glmnet(X, y, alpha=.99, family="binomial")
# 10 minutes in my MAC
#plot(result.lasso)Try to kick out some not useful words (Warning: this may crash your
laptop!!!) Because of the computational burden, I have saved the LASSO
results and other results into TextMining_lasso.RDS and
TextMining_glm.RDS.
We resume our analyses by loading the LASSO results
here. We extract useful variables using lambda.1se
transcript.result.lasso <- readRDS("data/TextMining_lasso_transcript.RDS")
plot(transcript.result.lasso)coef.1se <- coef(transcript.result.lasso, s="lambda.1se")
coef.1se <- coef.1se[which(coef.1se !=0),]
transcript.lasso.words <- rownames(as.matrix(coef.1se))[-1]
summary(transcript.lasso.words)## Length Class Mode
## 106 character character
# lasso.words[100:120]
topics.result.lasso <- readRDS("data/TextMining_lasso_topics.RDS")
plot(topics.result.lasso)coef.1se <- coef(topics.result.lasso, s="lambda.1se")
coef.1se <- coef.1se[which(coef.1se !=0),]
topics.lasso.words <- rownames(as.matrix(coef.1se))[-1]
summary(topics.lasso.words)## Length Class Mode
## 34 character character
As an alternative model we will run our relaxed LASSO.
Input variables are chosen by LASSO and we get a regular
logistic regression model. Once again it is stored as
result.glm in TextMining.RData.
transcript.sel_cols <- c("female", transcript.lasso.words)
# use all_of() to specify we would like to select variables in sel_cols
transcript.data_sub <- transcript.data2.train %>% select(all_of(transcript.sel_cols))
transcript.result.glm <- glm(female~., family=binomial, transcript.data_sub) # takes 3.5 minutes
## glm() returns a big object with unnecessary information
# saveRDS(result.glm,
# file = "data/TextMining_glm.RDS")
## trim the glm() fat from
## https://win-vector.com/2014/05/30/trimming-the-fat-from-glm-models-in-r/
stripGlmLR = function(cm) {
cm$y = c()
cm$model = c()
cm$residuals = c()
cm$fitted.values = c()
cm$effects = c()
cm$qr$qr = c()
cm$linear.predictors = c()
cm$weights = c()
cm$prior.weights = c()
cm$data = c()
cm$family$variance = c()
cm$family$dev.resids = c()
cm$family$aic = c()
cm$family$validmu = c()
cm$family$simulate = c()
attr(cm$terms,".Environment") = c()
attr(cm$formula,".Environment") = c()
cm
}
transcript.result.glm.small <- stripGlmLR(transcript.result.glm)
saveRDS(transcript.result.glm.small,
file = "data/TextMining_glm_small_transcript.RDS")
topics.sel_cols <- c("female", topics.lasso.words)
# use all_of() to specify we would like to select variables in sel_cols
topics.data_sub <- topics.data2.train %>% select(all_of(topics.sel_cols))
topics.result.glm <- glm(female~., family=binomial, topics.data_sub) # takes 3.5 minutes
## glm() returns a big object with unnecessary information
# saveRDS(result.glm,
# file = "data/TextMining_glm.RDS")
topics.result.glm.small <- stripGlmLR(topics.result.glm)
saveRDS(topics.result.glm.small,
file = "data/TextMining_glm_small_topics.RDS")Logistic regression model connects the chance of
being female given a text What are the female
(or feminine) words and how much it influence the chance being female?
In addition to explore the set of good words we also build word clouds
to visualize the correlation between positive words and negative
words.
Order the glm positive coefficients (positive words). Show them in a word cloud. The size of the words indicates the strength of positive correlation between that word and the chance being a female speaker
Order the glm negative coefficients (negative words)
TIME TO PLOT A WORD CLOUD!! Plot the world clouds, the size of the words are prop to the logistic reg coef’s
Positive word cloud:
transcript.result.glm <- readRDS("data/TextMining_glm_small_transcript.RDS")
transcript.result.glm.coef <- coef(transcript.result.glm)
transcript.result.glm.coef[200:250]
hist(transcript.result.glm.coef)# pick up the positive coef's which are positively related to the prob of being a good review
transcript.female.glm <- transcript.result.glm.coef[which(transcript.result.glm.coef > 0)]
# female.glm <- female.glm[-1] # took intercept out
names(transcript.female.glm)[1:20] # which words are positively associated with being female
transcript.female.fre <- sort(transcript.female.glm, decreasing = TRUE) # sort the coef's
round(transcript.female.fre, 4)[1:20] # leading 20 positive words, amazing!
length(transcript.female.fre) # 390 good words
# hist(as.matrix(female.fre), breaks=30, col="red")
transcript.female.word <- names(transcript.female.fre) # good words with a decreasing order in the coeff's
topics.result.glm <- readRDS("data/TextMining_glm_small_topics.RDS")
topics.result.glm.coef <- coef(topics.result.glm)
topics.result.glm.coef[200:250]
hist(topics.result.glm.coef)# pick up the positive coef's which are positively related to the prob of being a good review
topics.female.glm <- topics.result.glm.coef[which(topics.result.glm.coef > 0)]
# female.glm <- female.glm[-1] # took intercept out
names(topics.female.glm)[1:20] # which words are positively associated with female speakers
topics.female.fre <- sort(topics.female.glm, decreasing = TRUE) # sort the coef's
round(topics.female.fre, 4)[1:20] # leading 20 positive words, amazing!
length(topics.female.fre) # 390 good words
# hist(as.matrix(female.fre), breaks=30, col="red")
topics.female.word <- names(topics.female.fre) # good words with a decreasing order in the coeff'sThe above chunk shows in detail about the weight for positive words. We only show the positive word-cloud here. One can tell the large positive words are making sense in the way we do expect the collection of large words should have a positive tone towards the restaurant being reviewed.
set.seed(1)
par(mfrow=c(1,2))
cor.special <- brewer.pal(8,"Dark2") # set up a pretty color scheme
wordcloud(transcript.female.word[1:50], transcript.female.fre[1:50], # make a word cloud
colors=cor.special, ordered.colors=F)
wordcloud(topics.female.word[1:20], topics.female.fre[1:20], # make a word cloud
colors=cor.special, ordered.colors=F)Concern: Many words got trimmed due to stemming? We may redo dtm without stemming?
Negative word cloud:
Similarly to the negative coef’s which is positively correlated to the prob. of being a bad review
transcript.male.glm <- transcript.result.glm.coef[which(transcript.result.glm.coef < 0)]
transcript.male.glm <- transcript.male.glm[-1] # took intercept out
# names(male.glm)[1:50]
cor.special <- brewer.pal(6,"Dark2")
transcript.male.fre <- sort(-transcript.male.glm, decreasing = TRUE)
round(transcript.male.fre, 4)[1:40]## applaud phenomena cattl weekend beam diagram command wife
## 1.5747 1.4933 0.9892 0.7666 0.6751 0.5945 0.4745 0.4022
## competit follow draw ted europ man level page
## 0.3334 0.1750 0.1431 0.1385 0.1383 0.1316 0.1268 0.1249
## softwar simpl pictur ill control centuri produc guy
## 0.1188 0.1085 0.1069 0.0870 0.0819 0.0746 0.0726 0.0651
## two ago much second last comput problem will
## 0.0573 0.0560 0.0558 0.0476 0.0461 0.0445 0.0426 0.0365
## someth move done that got let now <NA>
## 0.0352 0.0348 0.0308 0.0134 0.0095 0.0092 0.0025 NA
par(mfrow=c(1,2))
# hist(as.matrix(bad.fre), breaks=30, col="green")
set.seed(1)
transcript.male.word <- names(transcript.male.fre)
wordcloud(transcript.male.word[1:30], transcript.male.fre[1:30],
color=cor.special, ordered.colors=F)
topics.male.glm <- topics.result.glm.coef[which(topics.result.glm.coef < 0)]
topics.male.glm <- topics.male.glm[-1] # took intercept out
# names(male.glm)[1:50]
topics.male.fre <- sort(-topics.male.glm, decreasing = TRUE)
round(topics.male.fre, 4)[1:20]## illus magic militari europ web demo math
## 2.1240 1.9898 1.5608 1.5221 0.7243 0.7002 0.6766
## democraci philosophi invent technolog design human <NA>
## 0.6679 0.6284 0.3089 0.2672 0.2202 0.0486 NA
## <NA> <NA> <NA> <NA> <NA> <NA>
## NA NA NA NA NA NA
# hist(as.matrix(bad.fre), breaks=30, col="green")
set.seed(1)
topics.male.word <- names(topics.male.fre)
wordcloud(topics.male.word, topics.male.fre,
color=cor.special, ordered.colors=F)Put two clouds (female and male) together:
It seems that judging from the transcript, the most characterizing feature of a male TED talk is the “applaud” it receives. The “applaud” is positively related to the speaker being male.
# wordcloud(lords, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, “Dark2”))
set.seed(1)
par(mfrow=c(1,2))
# cor.special <- brewer.pal(8,"Dark2")
cor.female <- brewer.pal(8, "Paired")
cor.male <- brewer.pal(8, "Blues")
wordcloud(transcript.female.word[1:50], transcript.female.fre[1:50],
colors=cor.female, ordered.colors=F, scale=c(2.2,0.3), max.words=100, random.order=FALSE, rot.per=0.25, use.r.layout=FALSE)
wordcloud(transcript.male.word[1:34], transcript.male.fre[1:34],
color="darkblue", ordered.colors=F, scale=c(2.2,0.3), max.words=100, random.order=FALSE, rot.per=0.25, use.r.layout=FALSE)transcript.female.word## [1] "reconsid" "irrespons" "nephew" "outweigh" "shini" "uphold"
## [7] "elicit" "refug" "knit" "energet" "boyfriend" "idealist"
## [13] "solidar" "lurk" "braveri" "hesit" "brighter" "mimic"
## [19] "transcend" "jellyfish" "wore" "husband" "clap" "uncov"
## [25] "hubbl" "thrive" "assess" "devast" "overcom" "pregnant"
## [31] "shell" "respond" "reli" "date" "meant" "deserv"
## [37] "resili" "hormon" "past" "telescop" "campaign" "immun"
## [43] "knew" "scientist" "keep" "bias" "loss" "shes"
## [49] "impact" "sex" "studi" "convers" "help" "woman"
## [55] "mom" "word" "bodi" "women" "food" "often"
## [61] "babi" "research" "love" "support" "even" "girl"
## [67] "sexual"
transcript.male.word## [1] "applaud" "phenomena" "cattl" "weekend" "beam" "diagram"
## [7] "command" "wife" "competit" "follow" "draw" "ted"
## [13] "europ" "man" "level" "page" "softwar" "simpl"
## [19] "pictur" "ill" "control" "centuri" "produc" "guy"
## [25] "two" "ago" "much" "second" "last" "comput"
## [31] "problem" "will" "someth" "move" "done" "that"
## [37] "got" "let" "now"
Judging from the topics, the most predicting topics of female talks are mostly relationship-oriented, whereas of the male talks are mostly object-oriented.
set.seed(1)
par(mfrow=c(1,2))
wordcloud(topics.female.word, topics.female.fre,
colors=cor.female, ordered.colors=F, scale=c(2.5,0.3), max.words=100, random.order=FALSE, rot.per=0.25, use.r.layout=FALSE)
wordcloud(topics.male.word, topics.male.fre,
color="darkblue", ordered.colors=F, scale=c(2.5,0.3), max.words=100, random.order=FALSE, rot.per=0.25, use.r.layout=FALSE)# par(mfrow=c(1,1))We have obtained two sets of models one from LASSO the
other from relaxed LASSO. To compare the performance as
classifiers we will evaluate their mis-classification error
and/or ROC curves using testing data.
Use the testing data we get mis-classification errors for one rule: majority vote.
transcript.predict.glm <- predict(transcript.result.glm, transcript.data2.test, type = "response")
transcript.class.glm <- ifelse(transcript.predict.glm > .5, "1", "0")
# length(class.glm)
transcript.testerror.glm <- mean(transcript.data2.test$female != transcript.class.glm)
transcript.testerror.glm # mis classification error is 0.34
pROC::roc(transcript.data2.test$female, transcript.predict.glm, plot=T) # AUC=0.66## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
topics.predict.glm <- predict(topics.result.glm, topics.data2.test, type = "response")
topics.class.glm <- ifelse(topics.predict.glm > .5, "1", "0")
# length(class.glm)
topics.testerror.glm <- mean(topics.data2.test$female != topics.class.glm)
topics.testerror.glm # mis classification error is 0.33
pROC::roc(topics.data2.test$female, topics.predict.glm, plot=T) # AUC=0.61## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] 0.3560606
##
## Call:
## roc.default(response = transcript.data2.test$female, predictor = transcript.predict.glm, plot = T)
##
## Data: transcript.predict.glm in 246 controls (transcript.data2.test$female 0) < 150 cases (transcript.data2.test$female 1).
## Area under the curve: 0.6733
## [1] 0.3257576
##
## Call:
## roc.default(response = topics.data2.test$female, predictor = topics.predict.glm, plot = T)
##
## Data: topics.predict.glm in 260 controls (topics.data2.test$female 0) < 136 cases (topics.data2.test$female 1).
## Area under the curve: 0.6355
lambda.1seOnce again we evaluate the testing performance of LASSO
solution.
transcript.predict.lasso.p <- predict(transcript.result.lasso, as.matrix(transcript.data2.test[, -1]), type = "response", s="lambda.1se")
# output lasso estimates of prob's
transcript.predict.lasso <- predict(transcript.result.lasso, as.matrix(transcript.data2.test[, -1]), type = "class", s="lambda.1se")
# output majority vote labels
# LASSO testing errors
mean(transcript.data2.test$female != transcript.predict.lasso) # 0.36
# ROC curve for LASSO estimates
pROC::roc(transcript.data2.test$female, transcript.predict.lasso.p, plot=TRUE) # AUC=0.67## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
topics.predict.lasso.p <- predict(topics.result.lasso, as.matrix(topics.data2.test[, -1]), type = "response", s="lambda.1se")
# output lasso estimates of prob's
topics.predict.lasso <- predict(topics.result.lasso, as.matrix(topics.data2.test[, -1]), type = "class", s="lambda.1se")
# output majority vote labels
# LASSO testing errors
mean(topics.data2.test$female != topics.predict.lasso) # 0.33
# ROC curve for LASSO estimates
pROC::roc(topics.data2.test$female, topics.predict.lasso.p, plot=TRUE) # AUC=0.63## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] 0.3611111
##
## Call:
## roc.default(response = transcript.data2.test$female, predictor = transcript.predict.lasso.p, plot = TRUE)
##
## Data: transcript.predict.lasso.p in 246 controls (transcript.data2.test$female 0) < 150 cases (transcript.data2.test$female 1).
## Area under the curve: 0.6889
## [1] 0.3232323
##
## Call:
## roc.default(response = topics.data2.test$female, predictor = topics.predict.lasso.p, plot = TRUE)
##
## Data: topics.predict.lasso.p in 260 controls (topics.data2.test$female 0) < 136 cases (topics.data2.test$female 1).
## Area under the curve: 0.6467
Comparing the two predictions through testing errors/ROC we do not see much of the difference. We could use either final models for the purpose of the prediction.
We have obtained two sets of models one from LASSO the
other from relaxed LASSO. To compare the performance as
classifiers we will evaluate their mis-classification error
and/or ROC curves using testing data.
Use the testing data we get mis-classification errors for one rule: majority vote.
transcript.predict.glm <- predict(transcript.result.glm, transcript.data2.test, type = "response")
transcript.class.glm <- ifelse(transcript.predict.glm > .5, "1", "0")
# length(class.glm)
transcript.testerror.glm <- mean(transcript.data2.test$female != transcript.class.glm)
transcript.testerror.glm # mis classification error is 0.34
pROC::roc(transcript.data2.test$female, transcript.predict.glm, plot=T) # AUC=0.66## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
topics.predict.glm <- predict(topics.result.glm, topics.data2.test, type = "response")
topics.class.glm <- ifelse(topics.predict.glm > .5, "1", "0")
# length(class.glm)
topics.testerror.glm <- mean(topics.data2.test$female != topics.class.glm)
topics.testerror.glm # mis classification error is 0.33
pROC::roc(topics.data2.test$female, topics.predict.glm, plot=T) # AUC=0.61## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] 0.3560606
##
## Call:
## roc.default(response = transcript.data2.test$female, predictor = transcript.predict.glm, plot = T)
##
## Data: transcript.predict.glm in 246 controls (transcript.data2.test$female 0) < 150 cases (transcript.data2.test$female 1).
## Area under the curve: 0.6733
## [1] 0.3257576
##
## Call:
## roc.default(response = topics.data2.test$female, predictor = topics.predict.glm, plot = T)
##
## Data: topics.predict.glm in 260 controls (topics.data2.test$female 0) < 136 cases (topics.data2.test$female 1).
## Area under the curve: 0.6355
lambda.1seOnce again we evaluate the testing performance of LASSO
solution.
transcript.predict.lasso.p <- predict(transcript.result.lasso, as.matrix(transcript.data2.test[, -1]), type = "response", s="lambda.1se")
# output lasso estimates of prob's
transcript.predict.lasso <- predict(transcript.result.lasso, as.matrix(transcript.data2.test[, -1]), type = "class", s="lambda.1se")
# output majority vote labels
# LASSO testing errors
mean(transcript.data2.test$female != transcript.predict.lasso) # 0.36
# ROC curve for LASSO estimates
pROC::roc(transcript.data2.test$female, transcript.predict.lasso.p, plot=TRUE) # AUC=0.67## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
topics.predict.lasso.p <- predict(topics.result.lasso, as.matrix(topics.data2.test[, -1]), type = "response", s="lambda.1se")
# output lasso estimates of prob's
topics.predict.lasso <- predict(topics.result.lasso, as.matrix(topics.data2.test[, -1]), type = "class", s="lambda.1se")
# output majority vote labels
# LASSO testing errors
mean(topics.data2.test$female != topics.predict.lasso) # 0.33
# ROC curve for LASSO estimates
pROC::roc(topics.data2.test$female, topics.predict.lasso.p, plot=TRUE) # AUC=0.63## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] 0.3611111
##
## Call:
## roc.default(response = transcript.data2.test$female, predictor = transcript.predict.lasso.p, plot = TRUE)
##
## Data: transcript.predict.lasso.p in 246 controls (transcript.data2.test$female 0) < 150 cases (transcript.data2.test$female 1).
## Area under the curve: 0.6889
## [1] 0.3232323
##
## Call:
## roc.default(response = topics.data2.test$female, predictor = topics.predict.lasso.p, plot = TRUE)
##
## Data: topics.predict.lasso.p in 260 controls (topics.data2.test$female 0) < 136 cases (topics.data2.test$female 1).
## Area under the curve: 0.6467
Comparing the two predictions through testing errors/ROC we do not see much of the difference. We could use either final models for the purpose of the prediction.
In this lecture, we apply LASSO to classify good/bad review based on the text. The core technique for text mining is a simple bag of words, i.e. a word frequency matrix. The problem becomes a high-dimensional problem. Using LASSO, we reduce dimension and train a model with high predictive power. Based on the model, we find out the positive/negative words and build a word cloud.
One natural question to ask about bag of words is to
take certain weight for each word or a document length. There are some
attempts to make adjustment. One way is called TF-IDF. We
have tried this and it did not really help in reducing the errors. We
left it in the Appendices.
There are many other models we can build. For example
trees through Random Forest or
boosting . We could also try to transform the features by
PCA first then apply the methods mentioned here. We could
also apply deep learning techniques to expand logistic regression
models. Some methods are implemented and we have left these analyses
into the Appendices as reference for you.